Clemson HGIC Home & Garden Factsheet Scraper
Pricing
Pay per event
Clemson HGIC Home & Garden Factsheet Scraper
Scrapes the Clemson HGIC factsheet library — 2,500+ science-based factsheets on plant care, diseases, pest management, lawn care, and food preservation. Outputs structured records: HGIC ID, body sections, symptoms, causal agent, management, products, authors.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Scrapes the Clemson Home & Garden Information Center (HGIC) factsheet library — 2,500+ science-based factsheets covering plant care, diseases, pest management, lawn care, and food preservation. Outputs structured records with HGIC ID, body sections, symptoms, causal agent, management recommendations, recommended products, authors, and related factsheets.
What It Does
Clemson HGIC is one of the largest university extension factsheet libraries in the US (SE US plant palette, 2,500+ documents). Each factsheet follows a consistent template with discrete sections: symptoms, causal agent (pathogen/pest binomial), management/control recommendations, and prevention. This actor parses that structure into machine-readable fields — exactly what plant-diagnosis apps, AI garden assistants, and agronomy SaaS platforms need as grounding data.
The actor reads the Yoast sitemap index to enumerate all factsheet URLs, then crawls each page with impit Chrome TLS fingerprinting — no proxy or CAPTCHA solver required.
Use Cases
- Training data for plant disease diagnosis AI and AI garden assistant models
- Structured extension knowledge base for horticulture SaaS
- Agronomy/landscaping content and reference data pipelines
- Garden app content enrichment (symptom/treatment lookup)
Input
| Field | Type | Default | Description |
|---|---|---|---|
| maxItems | integer | 10 | Maximum number of factsheets to scrape. Set to a large number to scrape all ~2,500+ factsheets. |
Output
Each item represents one HGIC factsheet.
| Field | Type | Description |
|---|---|---|
| factsheet_id | string | HGIC factsheet number, e.g. HGIC 1223 |
| slug | string | URL slug, e.g. turfgrasses-for-the-carolinas |
| title | string | Factsheet title |
| category | string | Subject category: Diseases, Insects, Lawns, Soils, Vegetables, Trees & Shrubs, Flowers, Fruits & Nuts, Food Safety & Preservation, Human Health & Safety, General |
| plant_subjects | string | Comma-separated plant names from the title |
| problem_type | string | Problem type: disease, insect, cultural, or none |
| summary | string | First meaningful paragraph / introductory text |
| body_sections | string | JSON array of {heading, text} objects for the full structured body |
| symptoms | string | Symptom description text (for disease/pest/damage factsheets) |
| causal_agent | string | Pathogen or pest scientific/common name |
| management | string | Management and control recommendation text |
| prevention | string | Prevention and cultural practices text |
| recommended_products | string | Comma-separated trade names and chemistries found in management sections |
| related_factsheets | string | Comma-separated related factsheet links (`title |
| last_updated | string | Revision date as shown in factsheet metadata, e.g. Feb 28, 2016 |
| authors | string | Comma-separated list of factsheet authors |
| images | string | Comma-separated image URLs embedded in the factsheet |
| factsheet_url | string | Canonical URL of the factsheet |
| scrapedAt | string | ISO-8601 timestamp when the record was scraped |
Sample Output
{"factsheet_id": "HGIC 1223","slug": "turfgrasses-for-the-carolinas","title": "Turfgrasses for the Carolinas","category": "Lawns","problem_type": "none","summary": "For over 50 years the lawn has been an integral part of the landscape...","body_sections": "[{\"heading\":\"Mowing\",\"text\":\"...\"}]","last_updated": "Feb 28, 2016","authors": "Millie Davenport, Gary Forrester","factsheet_url": "https://hgic.clemson.edu/factsheet/turfgrasses-for-the-carolinas/"}
Discovery Method
Reads the Yoast sitemap index at https://hgic.clemson.edu/sitemap.xml, filters for factsheet-sitemap.xml and factsheet-sitemap2.xml, and collects all /factsheet/<slug>/ URLs. The maxItems cap is applied before crawling begins.
Performance
- Memory: 128–256 MB
- Throughput: ~200 pages/minute at default concurrency (5)
- Full corpus (~2,500 factsheets): ~15–20 minutes
- Timeout: 2-hour default (sufficient for full corpus)